1 Introduction to the package

The “statgenSTA” (van Rossum (2021)) package is an package for analyzing data of plant breeding experiments with options for using three different modeling engines, “SpATS” (Rodríguez-Álvarez et al. (2017)), “lme4” (Bates et al. (2015)) and “asreml” (Butler (2021)), which are exsiting R packages that perform the analysis of field trials. Therefore, to use “statgenSTA” package for analysis, you need also to install R packages “SpATS,” “lme4” and “asreml.” In addition, this package is allowed to be used for the analysis of different types of experimental designs including incomplete block design, RCBD and spatial models with row column design. But note that spatial models can only be fitted with “SpATS” and “asreml.”

In this tutorial, I will show you step by step how to prepare data for analysis, perform analyses using different modeling engines and extract the results from the models.

2 Prepare data

First, I’ll show you how to import data, clean data and prepare data for analysis. I will use three field trial data sets as examples throughout this tutorial.

2.1 import data

Before importing data sets, we need to first load some packages. In R, we use the function library to do that. The first package we load is “readxl” and this package makes it easy to get data out of Excel and into R. The second one is “dplyr” which is a data manipulation tool. And the third one is the famous “ggplot2” package. We’ll use it to plot the layout of those experiments. The next package is “agridat” which provides a large number of data examples from various types of field trials. We’ll use one of the data sets from it to demonstrate how to perform the analysis for the experiments with alpha lattice design. The last package we load is obviously “statgenSTA” which is our goal of study of this tutorial. It is often convenient to use setwd to set the working directory to the one where you save the data files.

Let’s read the first dataset in. This dataset is stored in an excel file so we can use read_xls function in the package readxl to import the data into R. Since this excel file has some extra rows and columns that we want to exclude (show the excel file), we have to specify the cell range to read from. To see the structure of the data set, we could use str function. (briefly talk about the variables, no replicate, 4 trials)

library(readxl)
library(dplyr)
library(ggplot2)
library(agridat)
library(statgenSTA)
setwd("~/Desktop/Research/field trial")

dat1 <- read_xls("Updated 16-Progeny Row Harvest Data.xls", range="A1:N3723")
str(dat1)
## tibble [3,722 × 14] (S3: tbl_df/tbl/data.frame)
##  $ record   : num [1:3722] 1 11 21 31 41 51 61 71 81 91 ...
##  $ checks   : num [1:3722] 0 0 8 0 0 0 0 0 0 0 ...
##  $ prng     : num [1:3722] 3 4 5 6 7 8 9 10 11 12 ...
##  $ prow     : num [1:3722] 3 3 3 3 3 3 3 3 3 3 ...
##  $ bloc     : num [1:3722] 1 1 1 1 1 1 1 1 1 1 ...
##  $ plot     : num [1:3722] 1 11 21 31 41 51 61 71 81 91 ...
##  $ expt     : chr [1:3722] "16-A-101" "16-A-101" "16-A-101" "16-A-101" ...
##  $ entry    : num [1:3722] 29 46 94 6 51 3 48 76 62 19 ...
##  $ identity1: chr [1:3722] "G2130" "G686" "C7" "G1977" ...
##  $ buac5    : num [1:3722] 18.6 12.2 83.5 58.8 44.6 71.5 73.1 73.8 47.7 68.3 ...
##  $ yield_lb : num [1:3722] 0.65 0.42 3.04 2.15 1.63 2.62 2.67 2.71 1.75 2.52 ...
##  $ uniqueid : num [1:3722] 29055 29065 29075 29085 29095 ...
##  $ harvcomm : logi [1:3722] NA NA NA NA NA NA ...
##  $ comm     : chr [1:3722] "few plants" NA NA "right row very short, thin" ...
levels(factor(dat1$bloc))
## [1] "1" "2" "3" "4"

Next, we import the second dataset. This dataset is a csv file. We use read.csv function to read the data in. Note that here we need to set header = TRUE to let the function recognize the first line as the names of the variables, otherwise, it will be imported as a dataset with only one variable. (briefly talk about the variables, 2 replicates)

dat2 <- read.csv("Resolvable-IBD.csv", header = TRUE)
str(dat2)
## 'data.frame':    1080 obs. of  8 variables:
##  $ Plot    : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Rep     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SubBlock: int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Row     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Column  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Trial   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ GID     : int  6463633 3822784 6463797 6463812 6463832 6464471 6568003 6683296 6681676 6569575 ...
##  $ GYTha   : num  4.88 4.41 5.58 5.83 5.06 ...
levels(factor(dat2$Trial))
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

The last dataset, as I mentioned before, is a data example from the agridat package. This dataset comes from a field experiment with 3 replicates, 4 blocks per replicate and is laid out as an alpha lattice design.

data(burgueno.alpha)
dat3 <- burgueno.alpha
str(dat3)
## 'data.frame':    48 obs. of  6 variables:
##  $ rep  : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ block: Factor w/ 12 levels "B01","B02","B03",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ row  : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ col  : int  1 2 3 4 5 6 7 8 1 2 ...
##  $ gen  : Factor w/ 16 levels "G01","G02","G03",..: 16 1 6 10 11 4 15 5 12 3 ...
##  $ yield: int  2556 1361 1567 1797 2753 2089 2531 3144 2189 1864 ...

2.2 plot the layout of the data using ggplot

After we have the data sets read into R, we might want to take a look at the layout of these experiments. I’ll show you how to use ggplot to display the layout. We need to first specify the dataset we want to plot and then set the x axis to be the row variable and y axis to be the column variable. Since we want to see different trials in different color, we simply set color equal to the “bloc” variable. Here, note that instead of directly putting bloc there, we use factor of bloc to encode “bloc” as a categorical variable because the original “bloc” variable is a numeric variable. Then we plus geom_point with point size 2 to make a scatter plot for the row and column coordinates. Besides the layout by trial, we could also see the layout by the sub-experiment by setting color equal to “expt.” Here, since the number of sub-experiments is large, I use my own color palette which contains a broader range of color to differentiate from each other because the colors from the default palette are too similar to allow you to distinguish between the sub-experiments.

Similar to the first dataset, we plot the layout of the second dataset where color is set to be the factor of “Trial” since again “Trial” is a numerical variable. For dataset three, we simply set color equal to “block” because expt is already a factor.

myPalette <- readRDS("palette1.rds")
ggplot(dat1, aes(x=prow, y=prng, color = factor(bloc))) + geom_point(size = 2)

ggplot(dat1, aes(x=prow, y=prng, color = expt)) + geom_point(size = 2) +
  scale_color_manual(values = myPalette)

ggplot(dat2, aes(x=Row, y=Column, color = factor(Trial))) + geom_point(size = 2)

ggplot(dat3, aes(x=row, y=col, color = block)) + geom_point(size = 2) +
  scale_color_manual(values = myPalette)

2.3 clean data

The last thing we want to do is to see if there is any noticeable outliers and remove them. We simply look at the boxplot of the yield variable. This dollar sign is used to select the column for the yield variable “buac5.” As you can see, there is some very small values and even a very large negative value.

boxplot(dat1$buac5, ylab = "yield")

Here, we remove all yield value less than 30 and only keep the ones greater than 30 by filter function in dplyr package. Two ways of doing this: first, put the name of the dataset as the first argument, and then specify the subset of rows; alternatively, we could use this %>% sign here to indicate that the filter function is applied on dat1.

dat1.1 <- filter(dat1, buac5 >= 30)
dat1.1 <- dat1 %>% filter(buac5 >= 30)

Here are the boxplots for dat2 and dat3. Since there are not noticeable outliers, we just keep them there without any changes.

boxplot(dat2$GYTha, ylab = "yield")

boxplot(dat3$yield, ylab = "yield")

3 TD object

The first step when modeling field trial data with the statgenSTA package is creating an object of class TD (Trial Data). This object is used throughout the statgenSTA package as input for analyses. A TD object can be created from a data.frame with the function createTD.

3.1 create TD object

Let’s create the TD objects for these data sets. To create a TD object, we need to first rename columns to default column names used by the functions in the statgenSTA package. For example, the column in the data containing genotype is originally called “identity1” and is renamed to “genotype” and converted to a factor. The columns “expt” is also renamed and converted likewise. The columns “prow” and “prng” are renamed to “rowCoord” and “colCoord,” respectively. The TD object created for the first data set is a list with four items, one for each trial in the original data frame. The information about which columns have been renamed when creating a TD object is stored as an attribute of each individual data.frame in the object.

datTD1 <- createTD(data = dat1.1, genotype = "identity1", trial = "bloc", rowCoord = "prow",
                   colCoord = "prng", plotId = "plot", subBlock = "expt")

If we want to fit a model as a single trial, we need to create a TD object that identifies the whole data set as a single trial. To do that, we need to first add a new column to the original data set with a single value. Then in createTD function, we set trial equal to the name this new column.

dat1.1$bloc2 <- 1
datTD1_single <- createTD(data = dat1.1, genotype = "identity1", trial = "bloc2", rowCoord = "prow", 
                          colCoord = "prng", plotId = "plot", subBlock = "bloc")

If you want to recall the names of the trials, you can simply use names(datTD1). This will give you the names of all trials.

names(datTD1)
## [1] "1" "2" "3" "4"

After creating a TD object, data for new trials can be added to it usingaddTD. This function works in exactly the same way as createTD except that it adds data to an existing TD object instead of creating a new one. Dropping one or more trials from a TD object can be done using the function dropTD.

3.2 Summarizing a TD object

If we want to get an idea of the content of the data in the TD object, the summary function can be used. Since we have multiple trials, we need to specify which trial we want to summarize the data of. Here, traits is obviously the yield variable “buac5.” Using the default options, nine summary statistics are printed, but many more are available. The default plot creates plots for the layout of all trials in the TD object. This can be restricted to selected trials using the trials parameter.

summary(datTD1, trial = "1", traits = "buac5")
## 
## Summary statistics for buac5 in 1  
## 
##                            buac5
## Number of observations      1143
## Number of missing values       0
## Mean                       73.71
## Median                     73.10
## Min                        30.80
## Max                       112.00
## First quantile             64.40
## Third quantile             84.45
## Variance                 206.164
plot(datTD1, trials = "2")

This plot can be extended by highlighting interesting genotypes in the layout. It is also possible to color the subBlocks within the field by setting colorSubBlock = TRUE. Highlighting genotypes and coloring subBlocks cannot be done simultaneously. If both options are specified, only highlighting is done. Finally, it is possible to add the names of the genotypes to the layout using argument showGeno.

plot(datTD1, trials = "2", highlight = c("C2", "G1012"))

plot(datTD1, trials = "2", colorSubBlock = TRUE)

plot(datTD1, trials = "2", showGeno = TRUE)

Boxplots can also be made to get an idea of the contents of the data in the TD object. By default a box is plotted per trial in the data for the specified traits. Boxplots for multiple traits can be made at once.

### boxplot by field
plot(datTD1, plotType = "box", traits = "buac5")

Next, let’s create the TD object for the second dataset. Same as the first one, we need to rename the columns. For this dataset, we have replicates, so “Rep” is renamed to “repID.”

datTD2 <- createTD(data = dat2, genotype = "GID", trial = "Trial", repId = "Rep",
                   rowCoord = "Row",colCoord = "Column", plotId = "Plot",
                   subBlock = "SubBlock")

Then let’s get a summary of this TD object. Note that it is possible to summarize the data in a TD object for different groups. This can be done using the parameter “groupBy.” Here, we group it by “repId.” It will display three main summary statistics per group. Again, you can use the parameter what to see more statistics.

### summarize the data for a specific field
summary(datTD2, trial = "1", traits = "GYTha")
## 
## Summary statistics for GYTha in 1  
## 
##                          GYTha
## Number of observations     120
## Number of missing values     0
## Mean                      5.21
## Median                    5.30
## Min                       2.44
## Max                       6.27
## First quantile            4.83
## Third quantile            5.76
## Variance                 0.492
summary(datTD2, trial = "1", traits = "GYTha", groupBy = "repId")
## 
## Summary statistics for GYTha in 1 grouped by repId 
## 
##                            1     2
## Number of observations    60    60
## Mean                    5.19  5.22
## Standard deviation     0.767 0.634

Again, we can plot the TD object. Also notice that the trials in the boxplot can be grouped using the parameter “groupBy.” For example, if we have data from multiple years, we can group it by year. Colors can be applied to groups within trials using the parameter colorTrialBy. Here we want to color the boxes by “repId.”

### plot the layout
plot(datTD2, trials = "1", colorSubBlock = TRUE)

### boxplot by field
plot(datTD2, plotType = "box", traits = "GYTha", colorTrialBy = "repId")

Finally, we create the TD object for the third dataset. This dataset does not have multiple fields so we don’t have a column for trial. However, when fitting a model using fitTD function, an error messege will pop out if “trial” variable is not there. This seems to be a bug and we’ve already submitted a bug report on GitHub. For now, we have to create a column for trial and give whatever value to it. Here, I use 1.

dat3$trial <- 1
datTD3 <- createTD(data = dat3, genotype = "gen", rowCoord = "row", repId = "rep", 
                   colCoord = "col", subBlock = "block")

### summarize the data
summary(datTD3, traits = "yield")
## 
## Summary statistics for yield in 1  
## 
##                               yield
## Number of observations           48
## Number of missing values          0
## Mean                        1330.85
## Median                      1330.50
## Min                          228.00
## Max                         3144.00
## First quantile               933.75
## Third quantile              1681.25
## Variance                 440184.255
### plot the layout
plot(datTD3)

plot(datTD3, colorSubBlock = TRUE)

### boxplot by field
plot(datTD3, plotType = "box", traits = "yield")

4 Model analysis

After creating a TD object, a model can be fitted on the trial data. This is done using the function fitTD. The exact model fitted depends on the design of the trial. The design can be given as a parameter in the function. The output of fitTD is an object of class STA (Single Site Analysis), a list of fitted models with one item for each trial the model was fitted for.

4.1 Model types

As I mentioned in the introduction of the package, fitTD uses three different engines for fitting the models, namely SpATS, lme4 and asreml. For models with row column or resolvable row column design, SpATS is the default engine, for the other models lme4. This can always be overruled by specifying the function parameter engine. Models can be fitted for five different trial designs. These are listed in the following table with their respective model specifications. In the models above, fixed effects are indicated in italics whereas random effects are indicated in bold. Genotype can be fitted as fixed or as random effect depending on the value of the parameter what. Extra fixed effects may be fitted using the parameter covariates.

4.2 Spatial models

4.21 row column design

Using the TD object datTD from the previous part, a model for the trial 1 and trait “buac5” can now be fitted on the data. The trial was set up using a row column design. This is specified in fitTD using the parameter “design.” Here we use engine “SpATS” which is the default engine for fitting a spatial model. Here, we want to use genotype as a fixed effect, so we specify the parameter what to be “fixed.”

model1 <- fitTD(TD = datTD1, traits = "buac5", design = "rowcol", engine = "SpATS",
                what = "fixed")

model1_single <- fitTD(TD = datTD1_single, traits = "buac5", design = "rowcol", 
                       engine = "SpATS", what = "fixed")

Note that when using SpATS as a modeling engine for fitting a model, an extra spatial component is always included in the model. This spatial component is composed using the PSANOVA function in the SpATS package which uses 2-dimensional smoothing with P-splines. The parameters nseg and nest.div of PSANOVA can be modified using the control parameter in fitTD. Fitting the same model but specifying the number of segments for both rows and columns as 8 works as follows.

model1.1 <- fitTD(TD = datTD1, traits = "buac5", design = "rowcol", engine = "SpATS",
                  what = "fixed", control = list(nSeg = c(8, 8)))

4.22 resolvable row column design

Next, let’s fit a model to the dataset 2. This data set comes from an experiment with resolvable row column design so we set design to be “res.rowcol.” And now instead “SpATs” we use the engine “asreml” to fit the model and note that to fit a spatial model we have to set the parameter spatial equal to TRUE. In this case seven models are fitted and the best model, based on a goodness-of-fit criterion, either AIC or BIC, is chosen. The criterion to be used can be specified using the control parameter in fitTD. Fitting a model similar to the one above using asreml with BIC as goodness-of-fit criterion works as follows

model2 <- fitTD(TD = datTD2, traits = "GYTha", design = "res.rowcol", spatial = TRUE,
                trial = "1", engine = "asreml", control = list(criterion = "BIC"), 
                what = "fixed")
## Online License checked out Sun Apr 18 22:00:18 2021

4.3 Non-spatial models - Incomplete block design

The last dataset is from an experiment with Alpha lattice desgian, so we use incomplete block design by spectifying design as “ibd.” To fit a non-spatial model, we use “lme4” as the engine. You can also use “asreml” and set spatial parameter to be FALSE. Note that here by not supplying the what argument to the function, two models are fitted, one with genotype as a fixed effect and one with genotype as a random effect. The results of both these models are stored in the STA object model3. This is very useful for extracting different results from the model later on. A trade-off is that fitting two models takes more time than fitting only one so when fitting models on large datasets it is sensible to explicitly define what if only a subset of the results is needed as output.

model3 <- fitTD(TD = datTD3, traits = "yield", design = "ibd", engine = "lme4")

4.4 Model summary

To get a summary of the fitted model for each trial, we could again use summary function. By setting nBest = 5, the genotypes with their corresponding BLUEs and standard errors are displayed.

summary(model1, nBest = 5, trial="2")
## Summary statistics 
## ==================
## 
## Summary statistics for buac5 in 2  
## 
##                            buac5
## Number of observations      1204
## Number of missing values       0
## Mean                       78.26
## Median                     79.50
## Min                        30.80
## Max                       112.00
## First quantile             68.60
## Third quantile             88.80
## Variance                 210.829
## 
## 
## Predicted means (BLUEs) 
## ===============================
## Best 5 genotypes
##        BLUEs     SE
## G1847 112.75 10.738
## G1698 112.18 10.598
## G812  111.02 10.613
## G1846 110.04 10.785
## G421  109.11 10.257
summary(model2, nBest = 5, trial="1")
## Overview of tried spatial models 
## ================================
##     spatial random    AIC    BIC      H2      row     col      error
## 1      none   <NA> 23.611 31.973 0.39576       NA      NA 2.3875e-01
## 2  AR1(x)id   <NA> 25.387 36.537 0.38288 0.060430      NA 2.4272e-01
## 3  id(x)AR1   <NA> 22.482 33.632 0.44206       NA 0.30850 2.7395e-01
## 4 AR1(x)AR1   <NA> 24.060 37.998 0.44362 0.086215 0.33058 2.8425e-01
## 5  AR1(x)id  units 25.387 36.537 0.38287 0.060426      NA 3.8836e-07
## 6  id(x)AR1  units 22.482 33.632 0.44208       NA 0.30851 4.3833e-07
## 7 AR1(x)AR1  units 17.951 31.889 0.38638 0.527120 0.94684 2.0404e-01
##   correlated error converge
## 1               NA     TRUE
## 2               NA     TRUE
## 3               NA     TRUE
## 4               NA     TRUE
## 5          0.24272     TRUE
## 6          0.27395     TRUE
## 7          0.24029     TRUE
## 
## Selected spatial model:  AR1(x)AR1 
## 
## Summary statistics 
## ==================
## 
## Summary statistics for GYTha in 1  
## 
##                          GYTha
## Number of observations     120
## Number of missing values     0
## Mean                      5.21
## Median                    5.30
## Min                       2.44
## Max                       6.27
## First quantile            4.83
## Third quantile            5.76
## Variance                 0.492
## 
## 
## Predicted means (BLUEs) 
## ===============================
## Best 5 genotypes
##          BLUEs     SE
## 6464471 6.1800 0.4362
## 6681737 6.1090 0.4310
## 6681827 6.0858 0.4286
## 4902859 6.0739 0.4311
## 6683445 5.9931 0.4266
## 
## Standard Error of Difference (genotype modeled as fixed effect) 
## ===============================================================
##       GYTha
## min  0.4621
## mean 0.5131
## max  0.5536
## 
## Least Significant Difference (genotype modeled as fixed effect) 
## ===============================================================
##       GYTha
## min  0.9247
## mean 1.0267
## max  1.1078

For model 2 fitted using “asreml,” we can see the additional summary statistics by this line of code. The fitted models and the best model are stored in the output together with a summary table with details on the fitted models. The best model, the model with the lowest BIC, is the first one as shown in the table.

print(model2$`1`$sumTab$GYTha, digits = 2, row.names = FALSE)
##    spatial random AIC BIC   H2   row  col   error correlated error converge
##       none   <NA>  24  32 0.40    NA   NA 2.4e-01               NA     TRUE
##   AR1(x)id   <NA>  25  37 0.38 0.060   NA 2.4e-01               NA     TRUE
##   id(x)AR1   <NA>  22  34 0.44    NA 0.31 2.7e-01               NA     TRUE
##  AR1(x)AR1   <NA>  24  38 0.44 0.086 0.33 2.8e-01               NA     TRUE
##   AR1(x)id  units  25  37 0.38 0.060   NA 3.9e-07             0.24     TRUE
##   id(x)AR1  units  22  34 0.44    NA 0.31 4.4e-07             0.27     TRUE
##  AR1(x)AR1  units  18  32 0.39 0.527 0.95 2.0e-01             0.24     TRUE

Since genotype has been fitted both as fixed and as random factor in model 3, it is possible to calculate both the Best Linear Unbiased Estimators (BLUEs) and the Best Linear Unbiased Predictors (BLUPs). Therefore both are printed in the summary of the model together with their respective standard errors.

summary(model3, nBest = 10)
## Summary statistics 
## ==================
## 
## Summary statistics for yield in 1  
## 
##                               yield
## Number of observations           48
## Number of missing values          0
## Mean                        1330.85
## Median                      1330.50
## Min                          228.00
## Max                         3144.00
## First quantile               933.75
## Third quantile              1681.25
## Variance                 440184.255
## 
## 
## Estimated heritability 
## ======================
## 
## Heritability: 0.4 
## 
## Predicted means (BLUEs & BLUPs) 
## ===============================
## Best 10 genotypes
##       BLUEs      SE   BLUPs     SE
## G11 1987.09  276.05 1686.76 193.50
## G15 1838.99  276.05 1632.04 193.50
## G05 1746.34  274.24 1571.16 192.93
## G10 1712.12  274.24 1579.95 192.93
## G12 1583.12  276.05 1482.40 193.50
## G04 1543.90  274.24 1446.68 192.93
## G03 1378.00  276.05 1365.47 193.50
## G16 1364.62  274.24 1341.63 192.93
## G14 1348.01  276.05 1370.34 193.50
## G09 1328.65  276.05 1341.76 193.50

Please note that, when fitting a complicated model using fitTD, it is possible that the algorithm fail to converge. In this case, it’s a good idea to revert to a simpler model.

4.5 Model Plots

Again, like TD object, in addition to numeric summaries of a fitted model, the package also provides plots for the fitted models. Indeed, two types of plots can be made for fitted models.

The first plot is the base plot. It is a series of four plots, a histogram of the residuals, normal quantiles of the residuals, a scatter plot of residuals against fitted values and a scatter plot of absolute value of residuals against fitted values. These plots can be made by calling plot on the STA object. Plots can be made for multiple trials and multiple traits simultaneously. By default plots are made for all trials and all traits, but this can be restricted using the parameters trials and traits. Here, for simplicity, we only plot the model 1 for trial 2.

plot(model1, plotType = "base", trials = "2")

plot(model1_single, plotType = "base")

By specifying plotType = "spatial", we get the second type of plot which consists of five plots, spatial plots of the raw data, fitted values, residuals and either BLUEs or BLUPs, and a histogram of the BLUEs or BLUPs. If SpATS was used for modeling an extra plot with the fitted spatial trend is included. Note that spatial plots can only be made if spatial information, i.e. rowCoord and colCoord, is available in the TD object.

plot(model1, plotType = "spatial", trials = "2")

plot(model1_single, plotType = "spatial")

For model 2, we create plots for trial 1. Note that for the spatial plot, the spatial trend can be displayed on the original scale (spaTrend = "raw") or as percentege (spaTrend = "percentage"). If spaTrend = "percentage", the spatial trend is scaled, that is, divided by the average of the observed response variable of interest across the field and displayed as percentage.

plot(model2, plotType = "base", trials = "1")

plot(model2, plotType = "spatial", trials = "1", spaTrend = "percentage")

Here are the plots for model 3. Recall that for model 3, models with genotype as fixed effect and random effect are both fitted, in this case as a default the results will be plotted for the model with genotype fixed. If only one model is fitted, like model 1 and 2, the results of the fitted model will be plotted. This can be changed using the parameter what. Here, let’s plot the result for genotype random.

plot(model3, plotType = "base", what = "random")

4.4 Outlier detection

After fitting a model, it is possible to run an outlier detection on the results. This outlier detection checks the residuals in the fitted model and compares them to a limit. Observations with a residual higher than this limit are marked as outliers. The default limit is calculated based on the number of observations in the data. Setting a custom limit is also possible.

The output of the outlier detection function is a data.frame with the outliers and of vector of indices of the outliers in the original data. This vector can be used for removing outliers from the data. It is possible to run outlier detection on a model with genotype fitted as fixed, but when doing so it is only possible to detect outliers in genotypes that are replicated in the data. For unreplicated genotypes the residual will always be 0 and so these will never be marked as outliers. It’s fine if not specifying what= "fixed" because we only considered genotype as fixed effect when fitting the model. By setting verbose = F, the output will be suppressed and won’t be printed out.

outliers <- outlierSTA(model1, traits = "buac5", what = "fixed", verbose = F)

To get an idea of the magnitude of the outliers, the parameter commonFactors can be used to see the values of observations with similar characteristics. Here, for all outliers also the other observations for the same genotype are shown. Also, we set the limit for determining when a value is considered an outlier to be 3.

outliers <- outlierSTA(model1, traits = "buac5", rLimit = 3, commonFactors = "genotype", verbose = F)

4.5 Model reports

For STA objects there is a report function available in the statgenSTA package. This function creates a pdf report summarizing the main results of the fitted model. Also the tex file and figures used for generating the pdf report are saved.

The parameter outfile is used to change the name and location of the report. The value of this parameter should be a valid location and name for a pdf file, i.e. including the postfix “.pdf.” Non-existing directories are created by the report function. When an outfile is supplied, trial, trait and modeltype (fixed or random) are concatenated to it in the name of the report. here, let’s generate a report for model3. And again we can use the what argument to specify which model we want to report, i.e., genotype random or genotype fixed.

## Create a report
report(model3, outfile = "./Reports/model3_Report.pdf", what = "random")

The reports contains general information on the model fitted, a summary of the results, the plots described in the previous section, a list of best (highest BLUEs or BLUPs) genotypes and a scatter plot of all genotypes and their BLUEs or BLUPs.

4.6 Extracting model results

After fitting a model, various results can be extracted or calculated from the fitted model using the function extract. This can be anything from a single result for one trait and one trial to a list of different results for all models in an STA object. The results that can be extracted depend on the type of model fitted and sometimes on the modeling engine as well. For example, BLUEs can only be extracted if genotype was fitted as fixed effect. On the other hand, BLUPs and heritabilities can only be calculated and extracted if genotype was fitted as random effect.

All results that can be extracted are shown in the table below. In the first column is the result. This is also the value to be used for the parameter what in extract needed to extract the corresponding result. The second column shows what model needs to be fitted in order to be able to extract the result. Here F stands for genotype as fixed effect and R for genotype as random effect. The last column gives a short description of the result that will be extracted and, where needed, also states for which modeling engines it can be extracted.

The following two lines of codes extract BLUEs from model1 and BLUPs from model3. The extracted data is a list of items with BLUEs or BLUPs from each trial.

model1_blues <- extractSTA(STA = model1, what = "BLUEs")
model3_blups <- extractSTA(STA = model3, what = "BLUPs")

The variance components for the model with genotype as random effect can be extracted from model3 by letting what = "varCompR".

model3_vars <- extractSTA(STA = model3, what = "varCompR")

4.7 Additional Analysis

We can do ANOVA test for the difference of trial means with checks removed as follows.

### find checks
dat1_checks <- dat1.1 %>% filter(checks != 0)
geno_checks <- unique(dat1_checks$identity1)

### remove checks
m1_blues_nochecks <- model1_blues %>% filter(!(genotype %in% geno_checks))

### one-way anova test
fit <- lm(buac5 ~ trial, data = m1_blues_nochecks)
anova(fit)
## Analysis of Variance Table
## 
## Response: buac5
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## trial        3  27089  9029.8  53.707 < 2.2e-16 ***
## Residuals 3208 539363   168.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can even do pairwise comparisons between trials using emmeans function in “emmeans” pacakge.

library(emmeans)

m1_blues_emm <- emmeans(fit, ~trial)
m1_blues_emm
##  trial emmean    SE   df lower.CL upper.CL
##  1       74.2 0.411 3208     73.4     75.0
##  2       77.3 0.404 3208     76.5     78.1
##  3       80.3 0.408 3208     79.5     81.1
##  4       84.4 0.972 3208     82.5     86.3
## 
## Confidence level used: 0.95
pairs(m1_blues_emm)
##  contrast estimate    SE   df t.ratio p.value
##  1 - 2       -3.08 0.576 3208  -5.346 <.0001 
##  1 - 3       -6.09 0.579 3208 -10.504 <.0001 
##  1 - 4      -10.19 1.055 3208  -9.661 <.0001 
##  2 - 3       -3.01 0.574 3208  -5.234 <.0001 
##  2 - 4       -7.11 1.052 3208  -6.760 <.0001 
##  3 - 4       -4.11 1.054 3208  -3.897 0.0006 
## 
## P value adjustment: tukey method for comparing a family of 4 estimates

5 Calculate moving means

To calculate the moving means, we use the R package “mvngGrAd” (Technow (2015)). I wrote a wrapper function movingG() which uses movingGrid() function in the mvngGrAd package but takes similar arguments as createTD() and fitTD() in the statgenSTA package.

For the arguments traits, trialVar, rowCoord and colCoord, a character string of the name of the corresponding column has to be put there. Please note that trialVar is for the column name of the “trial” variable and trial is a number/character indicating the trial on which we want to use moving grid adjustments. In default, trial = NULL which calculate the moving means using the whole data with all the trials.

The shape of the grid must be designed by the user. This is done via three arguments to the function movingG(), shapeCross, layers and excludeCenter. With the help of shapeCross one specifies the cells that are to be included in a grid that extends from the center in 0, 90, 180 and 270 degree direction, with layers the cells that extend the grid in all other directions. The two can be used together or alone. With the argument excludeCenter one can decide whether or not to include the observation you want to make adjustment in the calculation of the moving mean.

The argument to shapeCross is given in form of a list with four components, each representing one of the directions.

library(mvngGrAd)

movingG <- function(data, traits, trialVar, trial=NULL, rowCoord, colCoord, shapeCross, layers, excludeCenter=TRUE) {
  if (is.null(trial)){
    mvg <- movingGrid(rows = data[[rowCoord]], columns = data[[colCoord]], obsPhe = data[[traits]], 
                      shapeCross = shapeCross, layers)
  } else {
    data1 <- data[data[[trialVar]]==trial, ]
    mvg <- movingGrid(rows = data1[[rowCoord]], columns = data1[[colCoord]], obsPhe = data1[[traits]], 
                      shapeCross = shapeCross, layers)
  }
  return(mvg)
}

The following codes calculate the moving means for the first data set (dat1) with “buac5” as the trait.

mvg_whole <- movingG(data = dat1.1, traits = "buac5", trialVar = "bloc", rowCoord = "prow", 
                     colCoord = "prng", shapeCross = list(1:2, 1:2, 1:2, 1:2), layers = 1)

The moving means for the trial 1 in dat1 can be calculate by setting trial=1.

mvg_trial1 <- movingG(data = dat1.1, traits = "buac5", trialVar = "bloc", trial = 1, rowCoord = "prow", 
                      colCoord = "prng", shapeCross = list(1:2, 1:2, 1:2, 1:2), layers = 1)

For designing the grid, function sketchGrid( ) takes the same arguments as movingGrid( ) and plots a visualization of the grid. This function is usually used to help with designing the grid by plotting its shape, that is, it can be used to verify that the actual arguments to shapeCross and layers do create the intended grid.

And here is how the grid we created for dat1 looks like.

sketchGrid(i = 10, j = 10, rowLimit = 20, colLimit = 20, shapeCross = list(1:2, ## down 
                                                                           1:2, ## up
                                                                           1:2, ## left
                                                                           1:2),## right
           layers = 1, excludeCenter = TRUE)

A summary of the adjustment process can be obtained by the function summary().

summary(mvg_whole)
## 
## Adjustment by function: movingGrid 
## 
## Maximum possible number of values in the grid: 12 
## 
## Mean number of values in grid: 11.31 
## 
## Number of observation: 3699 , number of NA-observations: 0 
## 
## Coefficient of correlation between moving means
## and observed phe. values: 0.59 
## 
## Regression coefficient of moving means regressed
## on observed phe. values : 0.97 
## 
## Experimental layout:
##      rows: 40 
##  columns: 125

The first line names the function used. In the second line the number of values (cells) in a complete grid is given. The mean number of values in the third line is always smaller, because for entries close to the edge of the experimental area only partial grids can be constructed. NA values also have this effect. The fourth line gives the number of observations§ and how many of these were NA. The Pearson coefficient of correlation between moving means and observed phenotypic values gives information on the strength of the influence of the covariate. If this value is too low, an adjustment will not yield better estimates than by using the observed value directly. In rare cases, for example under strong inter plant competition, the correlation coefficient can be below zero. In such a case one should not perform an adjustment. The next line gives the regression coefficient used for the adjustment.

The function entryData() extracts all information on an entry available.

head(entryData(mvg_whole))
##   row column adjustedPhe observedPhe movingMean nValues
## 1   3      5      89.419        83.5     72.040       5
## 2   3      6      68.437        58.8     68.200       6
## 3   3      7      51.381        44.6     71.150       8
## 4   3      8      82.444        71.5     66.850       8
## 5   3      9      88.389        73.1     62.362       8
## 6   3     10      88.242        73.8     63.237       8

The adjusted values can be obtained by using the function fitted(). Here are the first 10 adjusted values.

fitted(mvg_whole)[1:10]
##  [1] 89.419 68.437 51.381 82.444 88.389 88.242 53.149 72.781 94.459 82.095

The moving means are extracted by movingMean().

movingMean(mvg_whole)[1:10]
##  [1] 72.0400 68.2000 71.1500 66.8500 62.3625 63.2375 72.5250 73.5250 77.1625
## [10] 86.2125

The residuals for the model are returned by the function residuals().

residuals(mvg_whole)[1:10]
##          1          2          3          4          5          6          7 
##  11.251641  -9.730077 -26.786570   4.277131  10.222388  10.075124 -25.017986 
##          8          9         10 
##  -5.386288  16.291511   3.928374

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Butler, David. 2021. Asreml: Fits the Linear Mixed Model. www.vsni.co.uk.
Rodríguez-Álvarez, María Xosé, Martin P. Boer, Fred A. van Eeuwijk, and Paul H. C. Eilers. 2017. “Correcting for Spatial Heterogeneity in Plant Breeding Experiments with p-Splines.” Spatial Statistics 23: 52–71. https://doi.org/10.1016/j.spasta.2017.10.003.
Technow, Frank. 2015. R Package mvngGrAd: Moving Grid Adjustment in Plant Breeding Field Trials.
van Rossum, Bart-Jan. 2021. statgenSTA: Single Trial Analysis (STA) of Field Trials. https://CRAN.R-project.org/package=statgenSTA.